Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] do not copy column-major numpy arrays when creating Dataset from list of arrays #6773

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Jan 2, 2025

Following the suggestion in #6751 (comment) this changes the signature of the C API LGBM_DatasetCreateFromMats function to take an array of integers as is_row_major, indicating the layout of each matrix. Also updates the lightgbm.Dataset.__init_from_list_np2d method to provide this setting and avoid copying the arrays in the list.

@jmoralez
Copy link
Collaborator Author

jmoralez commented Jan 3, 2025

This is a breaking change in the C API, so I can maybe split this into two PRs, one making the change to the C API only and mark it as a breaking change, and then implement the logic in the python package and mark it as efficiency. WDYT @StrikerRUS @jameslamb?

@jmoralez jmoralez marked this pull request as ready for review January 3, 2025 16:38
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I think we should NOT break the ABI just for this purpose. What do you think about introducing a new function in the C API like LGBM_DatasetCreateFromMatsMixedLayouts()?

Most of the implementation of LGBM_DatasetCreateFromMats() could be moved into this function, and then LGBM_DatasetCreateFromMats() could internally construct a vector of is_row_major and call this new function.

You will probably have a better recommendation for the name, but what do you think about this general idea?

In my opinion it's worth growing the C API a little bit in exchange for not breaking other projects that link against lib_lightgbm. I found a few of those on GitHub (with this search):

And we can use the docs on the function to clearly describe what the difference is.

@jameslamb
Copy link
Collaborator

This is a breaking change in the C API, so I can maybe split this into two PRs, one making the change to the C API only and mark it as a breaking change, and then implement the logic in the python package and mark it as efficiency

Even if we do decide to keep this as a breaking change, I'd be ok with it being one PR. The changes are so tightly related to each other and the total size of code changes is fairly small. I think it's ok to have a single one under the breaking category. Also the others are under efficiency so people will understand from the release notes that lightgbm makes fewer copies for numpy input... think that's enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants